Updates zarr-parser to use obstore list_async instead of concurrent_map#892
Updates zarr-parser to use obstore list_async instead of concurrent_map#892norlandrhagen wants to merge 21 commits intomainfrom
zarr-parser to use obstore list_async instead of concurrent_map#892Conversation
virtualizarr/parsers/zarr.py
Outdated
| lengths = await _concurrent_map( | ||
| [(k,) for k in chunk_keys], zarr_array.store.getsize | ||
| ) | ||
| lengths = [size_map[k] for k in chunk_keys] |
There was a problem hiding this comment.
I think we really want to work hard to avoid creating any python lists / dicts at all
There was a problem hiding this comment.
instead we want obstore -> arrow -> numpy
via https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
There was a problem hiding this comment.
I think the hardest part of this is dealing with logic for missing keys - arrow might return these a nulls, but the to_numpy conversion doesn't support nulls?
Any operations we do should either be as pyarrow arrays or as numpy arrays, never as python collections
virtualizarr/parsers/zarr.py
Outdated
| stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True) | ||
| async for batch in stream: | ||
| size_map.update( | ||
| zip(batch.column("path").to_pylist(), batch.column("size").to_pylist()) |
There was a problem hiding this comment.
is this zipping of pylists creating a python dict? we want to avoid that
|
You will also want to add a new (private for now) constructor to the |
|
Hmm, now hitting a kerchunk error: |
virtualizarr/manifests/manifest.py
Outdated
| def _from_arrow( | ||
| cls, | ||
| *, | ||
| chunk_keys: "pa.Array", |
There was a problem hiding this comment.
I don't know that you need to pass this - maybe instead we should pass arrow arrays with nulls for unintialized chunks?
virtualizarr/parsers/zarr.py
Outdated
|
|
||
| path_batches = [] | ||
| size_batches = [] | ||
| stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True) |
There was a problem hiding this comment.
Just grabbing the underlying obstore store is a interesting idea...
Co-authored-by: Tom Nicholas <tom@earthmover.io>
This should be unit testable without using Kerchunk or Icechunk. We are simply creating the |
…ape]. Moves all weird arrow reshaping into zarr:build_chunk_manifest
Totally agree! I think... the kerchunk errors are unrelated. I added |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #892 +/- ##
==========================================
- Coverage 89.33% 89.11% -0.23%
==========================================
Files 34 33 -1
Lines 1997 2030 +33
==========================================
+ Hits 1784 1809 +25
- Misses 213 221 +8
🚀 New features to boost your workflow:
|
virtualizarr/manifests/manifest.py
Outdated
| pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths | ||
| ).to_numpy(zero_copy_only=False) | ||
|
|
||
| if shape is not None: |
There was a problem hiding this comment.
What happens if shape is None? Should that even be allowed?
virtualizarr/manifests/manifest.py
Outdated
| paths_np = ( | ||
| pc.if_else(pc.is_null(paths), "", paths) | ||
| .to_numpy(zero_copy_only=False) | ||
| .astype(np.dtypes.StringDType()) | ||
| ) | ||
| offsets_np = pc.if_else( | ||
| pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets | ||
| ).to_numpy(zero_copy_only=False) | ||
| lengths_np = pc.if_else( | ||
| pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths | ||
| ).to_numpy(zero_copy_only=False) |
There was a problem hiding this comment.
Lets split the arrow compute operations from the numpy conversions if only because it makes it easier to read.
virtualizarr/parsers/zarr.py
Outdated
| chunk_grid_shape = tuple( | ||
| math.ceil(s / c) for s, c in zip(zarr_array.shape, zarr_array.chunks) | ||
| ) | ||
| # scalar arrays go through the dict path instead of the pure arrow bit |
There was a problem hiding this comment.
It would be nice to not have to keep the whole old codepath around just for this special case...
virtualizarr/parsers/zarr.py
Outdated
| return ChunkManifest(chunk_map) | ||
| normalized_keys, full_paths, all_lengths = result | ||
|
|
||
| # Incoming: lots of LLM arrow mumbo jumbo for sparse arrays |
There was a problem hiding this comment.
there's a lot going on here that I'm suspicious could be simplified
There was a problem hiding this comment.
Totally agree. I took a shot at trying to simplify it a bit. The handling of sparse arrays makes it a bit verbose.
| flat_positions, | ||
| pc.multiply(pc.cast(dim_indices, pa.int64()), dim_stride), | ||
| ) | ||
| split_keys = pc.split_pattern(normalized_keys, pattern=".") |
There was a problem hiding this comment.
The chunk key encoding could also be "/" - we can probably read that from the zarr.json and use it here?
Closes Speed up ZarrParser using obstore and Arrow? #891
Tests passing
Full type hint coverage
Changes are documented in
docs/releases.rstSwaps out the
_concurrent_mapinbuild_chunk_mappingwith obstore'slist_async.Constructs the python ChunkManifest object's numpy arrays directly from the Arrow arrays. *
*There is still a conversion to a dict, so not quite.Bonus - removes the zarr vendor code.